P\ilS  ^5W 


'# 


Power  Analysis: 

A  Statistical  Tool  for  Assessing 

the  Utility  of  a  Study 


March  1997 


Ontario 


Ministry  of  the 
Environment 


Power  Analysis: 

A  Statistical  Tool  for  Assessing 

the  Utility  of  a  Study 


Prepared  by: 

Keith  M.  Somers 

Dorset  Environmental  Science  Centre 

Aquatic  Science  Section 

Ontario  Ministry  of  the  Environment 

P.O.  Box  39 

Dorset,  Ontario  POA  lEO 


March  1997 

Cette  publication  technique 
n'est  disponible  qu'en  anglais. 

Copyright:  Queen's  Printer  for  Ontario,  1997 

This  publication  may  be  reproduced  for  non-commercial 
purposes  with  appropriate  attribution. 

©Printed  on  50%  recycled  paper 
including  10%  post-consumer  fibre 

ISBN  0-7778-6958-6 
PIBS  3544E 


EXECUTIVE  SUMMARY 

Power  analysis  involves  a  group  of  statistical  techniques  that  are  used  to  assess  the 
ability  of  a  study  to  detect  a  significant  difference  when  it  truly  exists.  In  this  context, 
the  discovery  of  a  statistically  significant  difference  is  generally  used  as  an  indication  of 
a  noteworthy  event.   In  traditional  hypothesis  testing,  we  propose  a  null  hypothesis  (Hq) 
and  a  least  one  alternate  hypothesis  (HJ.  Statistical  significance  is  used  as  a  criterion 
for  rejecting  the  null  hypothesis  and  accepting  the  alternate  hypothesis.  However, 
statistical  significance  depends  on  a  number  of  factors  including  the  size  of  the  event 
(i.e.,  the  effect  size),  the  number  of  samples  and  the  probability  of  a  Type  I  error  (i.e., 
the  probability  of  incorrectly  rejecting  a  true  null  hypothesis).  In  power  analysis,  these 
same  factors  are  used  to  estimate  the  likelihood  of  correctly  rejecting  a  false  null 
hypothesis.    If  a  study  has  limited  power,  then  it  has  a  small  probability  of  detecting  a 
significant  event.  Through  the  use  of  power  analysis,  we  can  identify  studies  with 
limited  power.  These  studies  can  be  modified  to  increase  their  power  to  ensure  that 
they  achieve  their  purpose.  In  addition,  the  results  of  a  power  analysis  can  be  used 
with  additional  information  on  the  costs  of  each  step  in  the  study  in  order  to  optimize 
study  designs  to  provide  the  greatest  power  for  the  least  cost.  This  report  introduces 
the  concept  of  power  analysis,  but  it  does  not  address  study  optimization.  Worked 
examples  are  provided  to  illustrate  power  calculations  for  a  Student's  t  test,  a  one-way 
analysis  of  variance  and  a  simple  linear  regression.  Practical  applications  of  power 
analysis  (i.e.,  in  decision  making,  selecting  indicators  and  assessing  the  minimum 
detectable  change)  are  also  discussed. 
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INTRODUCTION 

Power  analysis  refers  to  a  family  of  statistical  techniques  that  assess  the  ability  of  a 
study  to  detect  a  significant  difference  when  such  a  difference  truly  exists.  Recent 
interest  in  power  analysis  has  been  fuelled  by  questions  surrounding  the  ecological 
relevance  of  a  statistically  significant  difference  (e.g.,  Green  1989,  McBride  et  al.  1993, 
Osenberg  et  al.  1994),  plus  a  growing  desire  to  make  management  decisions  that  err  in 
favour  of  (or  protect)  the  environment  (i.e.,  the  Precautionary  Principle;  Peterman  1990, 
Peterman  and  M'Gonigle  1992,  Mapstone  1995).   In  addition,  the  results  of  a  power 
analysis  can  be  used  with  other  information  and  costs  to  optimize  study  designs  to 
provide  the  greatest  power  for  the  least  cost  (e.g.,  Clarke  and  Green  1988,  Ferraro  et 
al.  1989). 

This  document  examines  the  concept  of  power  analysis,  but  it  does  not  address  study 
optimization.  This  report  has  two  goals:  (1)  to  introduce  the  theory  underlying  power 
analysis;  and  (2),  to  provide  worked  examples  of  simple  power  calculations.  These 
examples  include  a  Student's  t  test,  a  one-way  analysis  of  variance  and  a  linear 
regression.  Further  details,  formulas  and  additional  examples  are  found  in  the  textbook 
entitled  "Biostatistical  Analysis"  by  J.H.  Zar  (1984).  This  book  was  the  source  of  the 
formulas  reported  here. 


BACKGROUND 

Most  of  us  are  familiar  with  the  concept  of  hypothesis  testing.  That  is,  a  null  hypothesis 
of  no  difference  annong  several  means  is  proposed  (i.e.,  Hq)  and  a  statistical  test  is 
used  to  accept  or  reject  Hq.  If  this  null  hypothesis  is  rejected,  then  we  assume  that  the 
alternate  hypothesis  is  true  (i.e.,  H^:  at  least  one  significant  difference  is  observed). 
Typically  we  reject  Hq  when  the  probability  of  observing  a  result  of  this  magnitude  (i.e., 
alpha)  is  less  than  some  critical  value  (e.g.,  P<0.05;  see  Figure  1a).  This  P  value  is  the 
probability  of  observing  a  difference  of  the  same  magnitude  or  greater  simply  by  chance 
when  Hq  is  true  (see  Table  1).  As  a  result,  this  P  value  is  the  probability  of  erroneously 
rejecting  Hq  (called  a  Type  I  error).  The  complement  of  alpha  (i.e.,  1  -  alpha)  is  the 
probability  of  correctly  accepting  the  null  hypothesis  when  it  is  true  (e.g.,  we  are  95% 
confident  that  the  observed  difference  is  not  due  to  chance). 

There  are  two  types  of  errors  associated  with  statistical  tests  (see  Table  1).  The  Type  I 
error  described  above,  and  a  Type  II  error.  If  we  define  a  Type  I  error  as  a  false 
positive,  then  a  Type  II  error  is  a  false  negative.   In  this  context,  the  probability 
associated  with  a  Type  II  error  (called  beta)  is  the  likelihood  of  incorrectly  accepting  Hq. 
A  false  negative  of  this  sort  may  be  quite  costly  in  environmental  studies  since  the 
failure  to  detect  an  emerging  problem  may  lead  to  expensive  clean-up  at  some  later 
date  (Peterman  1990,  Fairweather  1991). 

Unlike  a  Type  I  error,  the  probability  of  a  Type  II  error  is  rarely  calculated  (Toft  and 
Shea  1983).  Hypothesis  testing  focuses  on  the  likelihood  of  incorrectly  concluding  that 
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a  difference  was  observed  (i.e.,  alpha),  not  the  probability  of  failing  to  detect  a  true 
difference  (i.e.,  beta;  see  Rotenberry  and  Wiens  1985.  Gerrodette  1987).  The  power  of 
a  test  is  the  complement  of  beta  (i.e..  1  -  beta:  see  Table  1  and  Figure  1b).  This 
complement  is  the  probability  of  detecting  a  difference  when  it  truly  exists  (i.e..  to 
accept  Ha  when  it  is  true).  Thus  power  provides  a  quantitative  estimate  of  the  ability  of 
a  study  to  detect  a  significant  difference. 

The  relationship  between  alpha  and  beta  is  readily  illustrated  (Figure  1c:  and  also  see 
Peterman  1 990,  McBride  et  al.  1 993).   In  ordef  to  reject  the  null  hypothesis  of  no 
difference  among  several  means,  we  compare  an  observed  F  value  to  the  appropriate 
statistical  tables.  If  the  observed  F  is  larger  than  the  tabulated  F  value  (at  say  Pi: 0.05), 
then  we  reject  the  null  hypothesis  of  no  difference.  Tabulated  F  values  are  based  on 
the  expected  distribution  of  F  values  if  Hq  is  true  (i.e..  a  central  F  distribution  based  on 
the  expected  Probability  Density  Function  of  F:  see  Figure  la).  The  shape  of  this 
distribution  is  a  function  of  the  degrees  of  freedom.  Alpha  is  the  proportion  of  the 
disthbution  (i.e.,  the  area  under  the  curve)  that  is  larger  than  the  critical  F  value  (e.g.. 
5%  of  the  possible  F  values  are  more  extreme  than  the  tabulated  F  value). 

For  every  null  hypothesis  there  is  at  least  one  alternate  hypothesis.  Just  like  Hn.  each 
alternate  hypothesis  has  an  expected  distribution  assuming  that  H.  is  true  (e.g..  a  non- 
central  F  distribution  based  on  a  Probability  Density  Function  assuming  the 
hypothesized  effect  size:  see  Figure  lb).  The  shape  of  this  distribution  is  determined 
by  the  degrees  of  freedom  plus  a  non-centrality  parameter.  This  non-centrality 
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parameter  is  a  function  of  the  hypothesized  effect  size  (i.e.,  the  critical  difference 
among  the  means)  which  shifts  the  H^  distribution  away  from  zero  (hence  the  name, 
non-central).  As  the  hypothesized  effect  size  is  increased,  the  non-central  F  distribution 
is  shifted  further  away  from  zero  (e.g.,  see  Peterman  1990).  As  a  result,  the  amount  of 
overlap  between  the  distributions  for  Hq  and  H^  decreases  as  the  effect  size  increases 
(see  Figure  1c). 

When  we  test  a  null  hypothesis  we  subjectively  choose  a  P  value  for  alpha  (e.g.,  the 
probability  of  a  Type  I  error  is  P<0.05;  see  Figure  la).  This  P  value  defines  a  critical  F 
value  that  subdivides  the  expected  distribution  for  Hq  (i.e.,  the  central  F)  into  two 
segments.  One  segment  represents  an  area  under  the  curve  of  1  -  alpha,  whereas  the 
other  segment  comprises  an  area  of  alpha.  In  addition,  this  critical  F  value  bisects  the 
expected  distribution  for  H^,  thereby  dividing  the  non-central  F  into  two  segments  with 
areas  of  beta,  and  1  -  beta,  respectively  (see  Figure  1  b  and  c).  Thus  alpha  and  beta 
(and  hence,  the  Type  I  and  Type  II  errors)  are  linked  by  the  shape  and  overlap  of  the 
expected  distributions  for  Hq  and  H^.  Moreover,  our  choice  of  alpha  also  specifies  beta. 

Generally  beta  and  the  power  of  a  test  are  calculated  after  the  data  have  been  collected 
(Toft  and  Shea  1983,  Peterman  1990).   In  this  context,  power  is  determined  by  the 
effect  size,  the  sample  size  and  alpha.  Since  the  effect  size  and  alpha  are  assumed  to 
be  fixed,  most  researchers  believe  that  only  the  number  of  samples  can  be  altered  to 
increase  the  power  of  a  test  (Bernstein  and  Zaiinski  1983,  Alldredge  1987).  However, 
at  least  two  other  options  are  available  (Underwood  1981,  Spooner  et  al.  1987, 
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Smeltzer  et  al.  1989).  One  option  is  to  decrease  the  field  and  laboratory  error 
associated  with  each  measurement  (i.e.,  increase  the  field  or  laboratory  precision),  and 
thereby  increase  the  relative  differences  between  the  means.  A  second  option  is  to 
increase  alpha  (i.e.,  the  probability  of  a  Type  I  error;  see  Underwood  1981)  from  0.05  to 
0.10,  or  perhaps  an  even  larger  value  (see  Figure  1c). 

Optimizing  Studies  based  on  their  Power  -  Every  study  is  undertaken  with  a 
particular  goal  in  mind.  The  likelihood  of  achieving  this  goal  is  a  function  of  the 
experimental  design  of  a  study.  Field  and  laboratory  protocols,  plus  the  number  of 
treatments  and  replicates  are  all  parts  of  the  experimental  design.   Because  each  step 
in  a  study  has  an  associated  cost,  information  on  the  experimental  design,  the  costs 
and  the  likelihood  of  detecting  a  significant  difference  can  be  balanced  to  ensure  that 
an  impact  of  a  specified  magnitude  is  detected  with  a  given  probability  at  a  minimum 
cost.  Thus,  study  optimization  involves  balancing  the  costs  and  the  statistical  power  of 
an  experimental  design  (e.g.,  Clarke  and  Green  1988,  Ferraro  et  al.  1989). 

Information  from  a  power  analysis  can  be  used  to  optimize  both  new  and  ongoing 
studies.  Considerable  cost  savings  are  possible  if  we  know  the  appropriate  sample 
sizes,  the  effect  size  and  the  power  of  the  test.  Too  few  samples  will  limit  the  power  of 
a  study,  possibly  requiring  a  follow-up  to  confirm  the  initial  results  (Spooner  et  al.  1987, 
Smeltzer  et  al.  1989),  or  the  collection  of  several  more  years  of  data  before  an  impact  is 
detected  (Trautmann  et  al.  1982,  Gerrodette  1987).  If  a  decision  is  delayed  because  of 
insufficient  power,  we  must  also  consider  the  costs  associated  with  additional  clean-up 
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because  of  this  delay  (Peterman  1990,  Fairweather  1991).  By  contrast,  too  many 
samples  may  be  an  unnecessary  waste  of  resources  resulting  in  statistically  significant 
results  that  have  little  ecological  significance.  Similarly,  ultra-sensitive  laboratory 
analyses  may  be  unwarranted  if  the  increased  precision  is  swamped  by  uncontrolled 
spatial  and  temporal  variability  in  the  field. 

The  power  of  a  study  is  a  function  of  the  experimental  design  (Bernstein  and  Zaiinski 
1983,  Green  1989).  For  example,  sample  size  (i.e.,  the  number  of  replicates)  affects 
the  power  of  a  study  because  larger  numbers  of  samples  will  allow  us  to  detect  smaller 
differences  between  several  means  (Alldredge  1987,  McBride  et  al.  1993).  In  this 
context,  the  size  of  the  standardized  difference  is  called  the  effect  size  (e.g.,  see  Figure 
lb).   Increasing  the  number  of  samples  does  not  change  the  effect  size.  However,  a 
larger  number  of  samples  increases  our  confidence  in  estimating  a  mean.  This 
increased  confidence  translates  into  a  smaller  error  term,  which  in  turn  enables  us  to 
detect  a  smaller  difference  between  means.  As  a  result,  the  number  of  samples,  the 
effect  size  and  the  error  (or  variability)  associated  with  each  mean  will  affect  the  power 
of  a  study. 

The  power  of  a  test  provides  a  quantitative  estimate  of  the  likelihood  of  detecting  an 
effect  of  a  specified  size  (Table  1).  If  a  study  has  relatively  low  power,  the  experimental 
design  can  be  changed,  perhaps  by  increasing  the  number  of  samples  or  by  controlling 
for  extraneous  factors  (e.g.,  Bernstein  and  Zaiinski  1983).  Alternatively,  the  effect  size 
can  be  increased  by  refining  field  and  laboratory  protocols  to  reduce  the  error 
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associated  with  sampling  and  analytical  variability  (Clarke  and  Green  1988,  Smeltzeret 
al.  1989).  In  this  way,  the  power  analysis  can  be  used  to  ensure  that  a  particular  effect 
size  can  be  detected  a  specified  proportion  of  the  time. 

Many  books  on  experimental  design  emphasize  the  importance  of  collecting  preliminary 
data  (e.g.,  Green  1979).  These  data  provide  estimates  of  the  anticipated  effect  size 
and  the  error.  Using  this  information,  we  can  estimate  the  appropriate  number  of 
samples  that  are  required  to  detect  a  difference  of  a  specified  magnitude  (e.g., 
Alldredge  1987,  Green  1989).  Moreover,  we  can  re-arrange  these  formulas  to  reveal 
the  size  of  a  difference  that  a  particular  experimental  design  will  detect  (i.e.,  the  effect 
size  or  minimum  detectable  change;  Zar  1984,  Spooner  et  al.  1987).   In  addition,  we 
can  calculate  the  probability  of  detecting  a  difference  of  a  specified  size  when  it  truly 
exists. 

For  new  studies,  we  require  preliminary  data  to  utilize  power  analysis  in  the  planning 
stages  (e.g.,  Green  1979,  Clarke  and  Green  1988).  Based  on  these  analyses  we  can 
increase  the  probability  that  a  new  program  will  achieve  its  goal.  By  contrast,  a  power 
analysis  of  an  ongoing  study  will  evaluate  the  adequacy  of  that  study  (e.g.,  Smeltzer  et 
al.  1989,  Fryer  and  Nicholson  1993).  Ongoing  studies  with  limited  power  can  be 
identified  and  the  weaknesses  corrected.  Alternatively,  studies  that  will  detect  trivial 
differences  can  be  adjusted  accordingly.  As  such,  power  analysis  allows  us  to  optimize 
studies  such  that  costs  are  minimized  and  the  likelihood  of  success  is  maximized. 


Unresolved  Issues  -  Every  study  should  have  sufficient  pov/er  to  detect  a  significant 
difference  in  order  to  protect  the  environment  (Underwood  1981,  Fairweather  1991). 
Moreover,  every  study  should  also  have  a  relatively  small  probability  of  falsely  indicating 
a  difference  when  no  true  difference  exists  (i.e.,  alpha).  Unfortunately,  an  objective 
criterion  for  acceptable  power  does  not  exist  (Peterman  1990).  Often  an  arbitrary 
minimum  such  as  80%  power  is  proposed  (e.g..  Green  1989).  Alternatively,  Toft  and 
Shea  (1983)  and  Rotenberry  and  Wiens  (1985)  suggest  that  beta  should  be  less  than 
or  equal  to  alpha  (i.e.,  studies  should  have  at  least  95%  power).  Similarly,  Peterman 
and  M'Gonigle  (1992)  emphasize  that  beta  must  be  less  than  alpha  if  we  intend  to 
adopt  a  Precautionary  Principle  in  order  to  protect  the  environment.  Irregardless,  any 
decision  regarding  what  constitutes  acceptable  power  must  ultimately  weigh  the  costs 
of  remediation  (i.e.,  clean-up  due  to  a  Type  II  error)  relative  to  the  costs  of  a  false  alarm 
(i.e.,  a  Type  I  error).  In  environmental  situations,  the  costs  of  remediation  generally 
outweigh  the  costs  of  a  false  alarm  (Fairweather  1991,  Osenberg  et  al.  1994). 
Consequently,  power  values  of  at  least  95%  seem  to  be  prudent. 

A  second  weakness  in  power  analysis  surrounds  the  definition  of  a  significant  effect 
(Rotenberry  and  Wiens  1985,  McBride  et  al.  1993,  Mapstone  1995).  Statistical 
significance  is  the  likelihood  of  observing  a  difference  of  the  same  size  or  larger  simply 
by  chance.  By  contrast,  ecological  significance  is  not  easily  defined  (Stewart-Oaten  et 
al.  1992,  Underwood  1992).  Statistical  significance  is  a  function  of  the  observed  effect 
size,  the  sample  size  and  alpha.  As  a  result,  relatively  small  differences  can  be 
statistically  significant  if  a  large  sample  size  is  used  (McBride  et  al.  1993).  Thus,  the 

8 


relevance  of  a  statistically  significant  effect  can  be  disputed  (Jones  and  Matloff  1986), 
since  tnvially  small  differences  can  be  statistically  significant. 

The  concept  of  ecological  significance  forces  us  to  re-express  an  obsen/ed  difference  in 
an  ecologically  meaningful  context  (Kersting  1984.  Hughes  1995).  As  a  result,  a 
difference  must  be  expressed  as  a  relative  value.   For  example,  a  10%  change  in  the 
average  abundance  of  a  population  may  be  ecologically  unimportant  if  the  normal 
range  of  year-to-year  vanation  is  ±20%.  However,  an  annual  10%  decline  in 
abundance  has  grave  implications  if  that  decline  continues  for  10  years  because  the 
population  will  disappear. 

To  translate  ecological  significance  into  an  appropriate  context  for  power  analysis, 
ecological  significance  should  be  the  critenon  for  constructing  the  alternate  hypothesis 
(i.e.,  HJ.  Thus,  H;^  should  propose  a  critical  difference  that  is  hypothesized  to  be 
ecologically  significant  (e.g..  Underwood  1992.  Osenberg  etal.  1994),  instead  of  simply 
specifying  the  criterion  of  a  statistically  significant  difference  from  zero.   In  this  context, 
constructing  appropriate  alternate  hypotheses  will  require  a  thorough  understanding  of 
background  or  normal  conditions.  Unfortunately,  information  on  background  conditions 
is  generally  lacking  for  environmental  issues,  and  a  single  "control"  or  reference  data 
point  is  not  sufficient  (UndenA/ood  1992,  Yan  et  al.  1996).  Considerable  work  is  needed 
to  resolve  the  question  of  what  constitutes  ecological  significance. 


EXAMPLE  POWER  ANALYSIS  CALCULATIONS 

Comparing  two  means  -  A  Student's  t  test  is  commonly  used  to  compare  two  means 
(or  two  averages).  A  typical  situation  might  involve  multiple  measurements  of  a  single 
parameter  at  two  locations.  For  example,  suppose  that  total  phosphorus  concentration 
(i.e.,  TP)  is  measured  in  a  small  river.  One  sampling  area  is  upstream  of  a  sewage 
treatment  plant  (STP)  and  the  other  is  a  short  distance  downstream  of  the  STP 
discharge.  Multiple  samples  have  been  collected  from  each  location,  and  the  question 
of  interest  is  whether  or  not  the  STP  discharge  affects  the  phosphorus  concentration  of 
the  river.  Here  Hq  states  that  there  is  no  increase  in  the  downstream  mean  TP.  The 
alternate  hypothesis,  H^,  states  that  a  significant  increase  was  observed.  Traditionally, 
our  Type  I  error  (alpha)  is  set  at  5%  (i.e.,  the  probability  of  incorrectly  concluding  that 
there  is  a  significant  difference  is  P<0.05).  By  contrast,  beta  is  unknown.  As  a  result, 
we  must  calculate  beta  (and  its  complement,  1  -  beta;  see  Table  1)  to  estimate  the 
power  of  our  test. 

Several  different  types  of  t  tests  could  be  used  in  this  scenario  (Zar  1984,  Green  1979, 
1989).  If  a  single  water  sample  was  collected  from  both  areas  on  the  same  day  and 
samples  were  collected  over  a  number  of  different  days,  then  each  upstream  sample 
would  be  paired  with  a  downstream  sample.  As  a  result,  a  paired  t  test  could  be  used 
to  determine  whether  or  not  the  mean  of  the  differences  for  each  day  was  significantly 
different  from  zero.  This  is  a  type  of  repeated-measures  analysis  where  each  pair  of 
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samples  is  distinguished  by  the  date  of  collection.  However,  all  of  the  samples  were 
collected  on  the  same  day  in  our  example  (see  Table  2).  Consequently,  there  is  no 
reason  why  individual  upstream  and  downstream  samples  should  be  paired  or  linked. 
Thus  the  appropriate  test  is  a  simple  t  test  where  the  means  from  each  location  are 
compared.  We  also  use  a  one-tailed  test  because  we  are  only  interested  in  an  increase 
in  total  phosphorus  at  the  downstream  location.  A  lower  phosphorus  concentration  in 
the  river  below  the  STP  would  not  suggest  that  the  STP  is  discharging  too  much 
phosphorus. 

Hypothetically,  ten  water  samples  were  collected  from  each  site  and  analysed  for  total 
phosphorus  concentration  (Table  2).  The  mean  for  the  upstream  location  was  4.34, 
whereas  the  mean  for  the  downstream  area  was  4.82.  The  difference  between  these 
two  means  is  0.48  or  an  increase  of  11.1%  relative  to  the  upstream  concentration.  This 
difference  is  the  observed  effect  size.  The  t  value  for  the  one-tailed  comparison  is 
2.665,  which  is  significantly  larger  than  the  tabulated  critical  toosd-ta.i)  oi  1.734  (P<0.008). 
Thus,  the  total  phosphorus  concentration  below  the  STP  discharge  was  significantly 
greater  than  the  upstream  concentration.  The  observed  probability  of  a  Type  I  error 
was  0.008,  indicating  less  than  one  chance  in  100  comparisons  of  incorrectly  rejecting 
the  null  hypothesis  of  no  increase.  However,  beta  and  the  power  of  the  test  are 
unknown. 

In  order  to  estimate  power,  we  calculate  phi  using  the  appropriate  power  formula  (Zar 
1984,  p.  137:  formula  9.29;  or  see  Table  2).  The  unstandardized  effect  size  called  delta 
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is  the  observed  difference  of  0.48  between  the  upstream  and  downstream  mean 
phosphorus  concentrations.  The  pooled  error  associated  with  estimating  each  mean 
(i.e.,  s^p)  is  0.162,  and  the  sample  size  at  each  location  (i.e.,  n)  is  10.  All  of  this 
information  is  available  from  a  spreadsheet  such  as  Excel®.  The  calculated  value  for 
phi  is  1.747.  Using  Figure  B.I  in  Zar  (1984  p.  657)  with  1  and  18  degrees  of  freedom, 
the  power  (i.e.,  1  -  beta)  is  estimated  as  0.65.  The  probability  of  committing  a  Type  II 
error  (i.e.,  beta)  is  0.35.  Consequently,  the  likelihood  of  detecting  an  11.1%  increase  in 
total  phosphorus  concentration  is  approximately  two  out  of  every  three  comparisons 
(assuming  that  all  things  are  the  same  in  each  comparison). 

Given  this  experimental  design  and  the  associated  variability  around  each  mean  (i.e., 
s^pi  see  Table  2),  we  have  moderate  power  to  detect  an  impact.  This  moderate  power 
implies  that  one-third  of  the  time  we  will  incorrectly  conclude  that  there  is  no  impact 
when  an  11.1%  increase  is  observed.  By  contrast,  we  are  very  unlikely  to  reject  the  nul 
hypothesis  of  no  increase  when  it  is  true  (i.e.,  the  Type  I  error  probability  is  0.008). 

We  can  increase  the  power  of  the  test  by  increasing  the  number  of  samples  from  each 
area  (i.e.,  increasing  n),  or  by  reducing  the  variability  about  the  mean  for  each  area  (i.e. 
decreasing  s^p).  A  change  in  the  pooled  variance  may  be  achieved  by  refining  the  field 
or  laboratory  (i.e.,  analytical)  protocols  (e.g.,  Hanna  and  Peters  1991).  However,  the 
decision  regarding  the  best  way  to  optimize  a  study  should  be  based  on  the  relative 
costs  of  additional  samples  versus  the  costs  of  improved  field  or  laboratory  methods 
(e.g.,  Clarke  and  Green  1988,  Ferraro  et  al.  1989). 

12 


We  can  also  increase  our  power  by  using  a  larger  minimum  detectable  change  (or 
effect  size).  A  change  in  the  minimum  detectable  effect  size  shifts  the  non-central 
distribution  for  H^  further  away  from  zero  (see  Figure  1 ).   Here  we  must  select  an 
acceptable  power  for  the  study  (e.g.,  80%.  90%  or  95%)  and  specify  the  sample  size 
and  estimated  variability  (see  Zar  1984,  p.  135:  formula  9.25).  Based  on  the  scenario 
described  above  (i.e..  Table  2),  a  power  of  80%  is  achieved  if  the  minimum  detectable 
change  is  raised  to  0.533  or  12.3%  (from  11.1%).  By  contrast,  a  power  of  90%  is 
obtained  with  a  minimum  detectable  change  of  0.618  or  14.2%,  whereas  an  effect  size 
of  0.690  or  1 5.9%  can  be  detected  95%  of  the  time.   In  this  example,  acceptable 
statistical  power  can  be  obtained  with  relatively  small  increases  in  our  minimum 
detectable  change. 

This  simple  example  emphasizes  an  inherent  weakness  associated  with  using  power 
calculations  to  optimize  a  study.  Two  arbitrary  decisions  must  be  made.  The  first 
decision  involves  selecting  an  acceptable  level  of  power  to  detect  an  impact.  A  test 
with  80%  power  implies  a  Type  II  error  probability  of  20%.  Consequently,  one  out  of 
every  five  significant  impacts  may  go  undetected.  A  related  issue  surrounds  the 
subjective  decision  of  what  consititutes  an  ecologically  significant  impact.  To  date  there 
is  little  guidance  for  objectively  determining  the  threshold  at  which  a  change  becomes 
unacceptable.  Both  of  these  issues  remain  to  be  resolved. 

Comparing  more  than  two  means  -  A  simple  one-way  analysis  of  variance,  or  single- 
factor  ANOVA,  is  a  common  method  to  determine  whether  or  not  two-or-more  means 
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are  significantly  different.  To  illustrate  this  type  of  power  analysis,  we  compare 
biological  data  modelled  after  a  study  of  several  streams  draining  into  Lake  Simcoe 
(Table  3).  The  data  comprise  the  combined  counts  of  mayflies  (Ephemeroptera), 
stoneflies  (Plecoptera)  and  caddisflles  (Trichoptera)  in  samples  of  approximately  200 
macroinvertebrates  collected  from  five  riffles  in  each  stream  (i.e.,  an  EPT  index  of 
sorts).  Histoncally,  the  proportion  of  these  three  taxa  in  a  sample  of  benthic 
macroinvertebrates  has  proven  to  be  a  sensitive  index  of  water  quality.  Streams  with 
higher  EPT  values  generally  have  better  water  quality.  Here,  we  propose  the  null 
hypothesis  (Hq)  that  the  mean  EPT  indices  from  each  of  three  streams  are  the  same. 
The  alternate  hypothesis  (H^)  states  that  at  least  one  of  the  means  differs  significantly 
from  the  others  (P<0.05). 

The  mean  EPT  index  from  each  stream  reveals  that  Stream  A  supports  more  mayflies, 
stoneflies  and  caddisflies  than  the  other  two  streams  (Table  3).  The  mean  index  for 
Stream  A  is  81 .2,  whereas  the  means  for  streams  B  and  C  are  71 .4  and  67.0, 
respectively.  The  one-way  ANOVA  compares  the  three  means  and  provides  an 
observed  F  value  of  7.585.  By  comparison,  the  tabulated  F  value  with  2  and  12 
degrees  of  freedom  at  P=0.05  is  3.885.  The  observed  F  value  is  larger  than  the 
tabulated  value,  with  an  associated  probability  of  0.007.  As  a  result,  we  reject  Hq  and 
accept  H^:  that  at  least  one  of  the  means  is  significantly  different.  In  order  to  identify 
which  means  differ,  we  might  use  a  follow-up  test  such  as  a  Scheffe  or  Student- 
Newman-Keuls  test  (e.g.,  see  Zar  1984). 
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The  ANOVA  table  from  Excel®  provides  all  of  the  information  necessary  for  the  power 
analysis  (Table  3).  We  calculate  phi  and  then  estimate  beta  and  the  associated  power 
of  the  test  (i.e.,  from  Zar  1984,  p.  174:  formula  11.27).  The  observed  effect  size  is 
estimated  from  the  variance  among  the  three  means  (i.e.,  the  between-group  mean 
square)  relative  to  the  average  variability  within  a  stream  (i.e.,  the  within-group  or  error 
mean  square,  much  like  s^p  in  the  Student's  t  test).  Based  on  these  calculations,  phi  is 
2.095.  Using  Figure  B.I  in  Zar  (1984:  p.  658),  the  power  associated  with  phi  with  2  and 
12  degrees  of  freedom  is  approximately  0.80.  Thus,  we  have  an  80%  chance  of 
detecting  a  difference  as  large  as  the  one  observed  here.  This  implies  that  four  out  of 
every  five  comparisons  will  correctly  identify  a  significant  difference  between  the  three 
EPT  means.  Alternatively,  once  in  every  five  comparisons  we  will  mistakenly  conclude 
that  there  is  no  significant  difference  among  the  three  streams. 

Assessing  the  power  of  a  linear  trend  -  We  often  use  a  simple  linear  regression  to 

estimate  the  relationship  between  two  variables.  If  the  X  variable  represents  some 
measure  of  time,  then  we  might  ask  if  a  second  variable  (i.e.,  Y)  changes  significantly 
over  time  (e.g.,  Gerrodette  1987,  Edwards  and  Perkins  1992).  Here  our  example  is 
modelled  after  the  midsummer  catches  of  crayfish  from  a  lake  in  south-central  Ontario 
(Table  4).  Crayfish  were  collected  from  the  same  lake  each  summer  for  eight  years 
(i.e.,  from  1988  to  1995).  The  catch  data  are  expressed  as  the  average  number  of 
crayfish  caught  per  trap  per  night. 
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The  regression  of  crayfish  catch  against  sampling  year  reveals  a  downward  trend  (i.e., 
a  slope  of  -0.092;  Table  4).  This  slope  indicates  an  average  decline  in  the  crayfish 
catch  of  0.092  animals  per  trap  per  year.  When  compared  to  the  predicted  relative 
abundance  for  1988,  this  change  implies  a  10.7%  decline  each  year.  Should  this  trend 
continue,  crayfish  relative  abundance  will  fall  to  zero  by  the  tenth  year  of  the  study.  In 
this  example,  we  want  to  estimate  the  power  to  detect  a  change  of  this  magnitude. 

There  are  several  methods  to  estimate  the  power  of  a  regression  (Zar  1984:  pp.  283-4). 
One  simple  approach  (Zar  1984,  p.  312;  formula  19.12)  contrasts  the  observed 
correlation  coefficient  with  the  critical  value  from  standard  statistical  tables  (e.g.,  Zar 
1984,  pp.  570-1 ;  Table  B.16).  This  test  is  analogous  to  assessing  the  power  or  ability 
to  distinguish  the  observed  slope  from  zero  (e.g.,  Gerrodette  1987,  Edwards  and 
Perkins  1992). 

In  this  example,  the  expected  correlation  from  statistical  tables  is  based  on  a  one-tailed 
test  because  we  are  only  interested  in  our  power  to  detect  a  decline  in  crayfish  relative 
abundance  (see  Table  4).  The  observed  and  tabulated  correlation  coefficients  are 
transformed  using  Fisher's  Z  transformation  (Zar  1984,  p. 573;  Table  B.I 7).    Then  the 
square  root  of  the  adjusted  degrees  of  freedom  is  multiplied  by  the  difference  between 
the  transformed  correlations.  This  product  is  a  standard  normal  deviate  (i.e.,  Zgd,)  and 
the  probability  of  obtaining  a  value  of  this  magnitude  or  larger  is  determined  from  a 
standard  table  of  the  proportions  under  a  normal  curve  (e.g.,  Zar  1984,  p.  483;  Table 
B.2).  This  probability  is  beta,  the  complement  of  the  power  of  the  test. 
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The  probability  of  a  Type  I  error  associated  with  a  two-tailed  test  of  this  regression  was 
slightly  larger  than  the  critical  value  of  P<0.05  (i.e.,  P<0.054  for  the  default  test  in 
Excel®;  see  Table  4).   However,  we  are  only  interested  in  a  one-tailed  test  because  we 
defined  a  significant  impact  (i.e.,  HJ  as  a  non-zero  decline  in  the  catch.  The 
corresponding  one-tailed  test  of  the  slope  (i.e.,  t=-2.394,  P=0.027)  is  significant  at 
P<0.05.   Consequently,  we  might  erroneously  reject  Hq  only  2.7%  of  the  time.   By 
contrast,  the  power  of  detecting  a  decline  of  this  magnitude  was  0.622.  As  a  result,  we 
would  correctly  conclude  that  there  was  a  significant  decline  in  the  crayfish  catch  only 
six  out  of  every  ten  times. 

From  the  power  analysis,  we  found  that  this  hypothetical  study  has  limited  power. 
Consequently,  some  refinement  may  be  necessary.  Changes  in  the  field  protocols  may 
reduce  the  error  and  thus,  increase  the  power  (e.g..  Hanna  and  Peters  1991). 
Alternatively,  we  may  simply  need  several  more  years  of  data  to  gain  sufficient  power  to 
detect  a  change  of  this  magnitude  (i.e.,  increasing  power  by  increasing  the  sample 
size).  Unfortunately,  by  choosing  this  latter  option  the  crayfish  population  may  have 
effectively  disappeared  before  we  can  conclude  that  the  population  is  in  decline.  This 
point  is  illustrated  by  using  the  predicted  catch  of  0.86  in  1988  and  the  observed  slope 
of  -0.092.  The  expected  catch  will  fall  to  zero  within  ten  years  if  this  trend  continues 
(i.e.,  0.86  +  (1 0  X  -0.092)  <  0).  As  a  result,  not  all  of  the  options  to  improve  the  power  of 
a  study  may  be  tenable.   Moreover,  this  example  illustrates  why  calculating  power  is  so 
important.   If  it  takes  ten  years  of  monitoring  to  detect  an  annual  decline  of  10%  with  a 
power  of  95%,  then  the  population  will  have  disappeared  by  the  time  we  acknowledge 
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that  there  is  a  problem.  Similar  types  of  problems  have  been  identified  elsewhere  (e.g., 
Spooner  et  al.  1987,  Edwards  and  Perkins  1992,  Fryer  and  Nicholson  1993). 

OTHER  APPLICATIONS 

Power  analysis  estimates  the  likelihood  that  an  impact  of  a  specified  size  will  be 
detected.  By  balancing  gains  and  losses  in  power  with  the  relative  variability  and  costs 
associated  with  each  step  in  field  and  laboratory  protocols,  programs  can  be  optimized 
to  provide  suitable  power  for  a  minimal  cost  (e.g.,  Clarke  and  Green  1988,  Ferraro  et  al. 
1989).  However,  there  are  other  applications  of  power  analysis;  several  examples 
follow. 

Adding  objectivity  to  otherwise  subjective  decisions  -  Often  we  must  decide  among 
a  number  of  options  with  no  more  guidance  than  experience  and  intuition.  Power 
analysis  can  provide  a  relative  scale  to  compare  these  options  (e.g.,  Ferraro  and  Cole 
1992).  Moreover,  the  availability  of  critical  values  for  80%  and  95%  power  provides  a 
degree  of  objectivity  for  these  types  of  decisions. 

A  recent  assessment  of  a  biomonitoring  program  asked  the  question:  "What  is  the 
minimum  taxonomic  level  that  will  distinguish  the  benthic  communities  of  representative 
lakes  with  at  least  95%  power?".  This  question  was  proposed  because  one  of  the 
biggest  costs  in  this  program  was  associated  with  identifying  the  macroinvertebrates.  If 
a  coarser  taxonomic  level  provided  sufficient  power  to  distinguish  the  lakes,  then 
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considerable  savings  would  be  realized  (e.g.,  Ferraro  et  al.  1989,  Ferraro  and  Cole 
1992). 

In  this  study,  five  samples  were  collected  from  each  of  five  lakes  in  south-central 
Ontario  (e.g.,  Reid  et  al.  1995).  The  macroinvertebrates  were  removed  from  each 
sample,  identified  to  the  lowest  practical  taxonomic  level  (generally  the  species  level) 
and  then  counted.  The  resultant  counts  for  the  different  taxa  were  sorted  by  taxonomic 
group  and  recombined  to  provide  separate  data  sets  based  on  species,  genus,  family, 
order,  class  and  phylum  groupings.  Six  methcs  were  calculated  for  each  data  set. 
These  metrics  included  the  total  number  of  taxa  (at  that  taxonomic  level),  Shannon- 
Wiener  diversity  (H'),  plus  the  first  two  ordination  axes  for  the  logarithmically 
transformed  abundance  data  (log[X+1]),  and  the  first  two  axes  from  an  ordination  of  the 
presence-absence  data  (P/A).  The  first  two  metrics  are  univariate  in  nature,  whereas 
the  latter  four  methcs  are  multivahate  indices  of  community  structure. 

A  separate  one-way  analysis  of  vahance  was  calculated  for  each  data  set  using  each  of 
the  six  metrics.  Each  ANOVA  distinguished  the  five  lakes  using  the  average  variation 
among  the  five  sites  in  each  lake  as  a  pooled  estimate  of  the  error.  The  resultant 
ANOVA  tables  were  used  to  calculate  phi,  the  power  statistic  (Figure  2)  and  these 
values  were  compared  to  the  tabulated  values  of  phi  associated  with  a  power  of  80% 
and  95%.  Methcs  providing  a  power  of  95%  were  assumed  to  be  suitable  for 
distinguishing  the  lakes.  Values  falling  below  80%  were  assumed  to  indicate 
unacceptable  power. 
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The  resultant  plots  revealed  that  all  of  the  species-level  metrics  exhibited  power  of  at 
least  90-95%  (Figure  2).  The  diversity  index  had  the  lowest  power  at  this  taxonomic 
level.  By  contrast,  all  four  ordination  axes  displayed  power  values  that  were 
considerably  larger  than  the  95%  level.  Consequently,  differences  among  the  lakes  are 
not  obscured  by  variation  among  the  five  sites  within  a  lake. 

Generally  the  power  of  a  given  metric  declined  as  the  taxonomic  resolution  was 
reduced  (Figure  2).  For  example,  the  total  number  of  taxa  based  on  the  species-level 
taxonomy  had  slightly  more  than  95%  power.  The  power  of  this  metric  dropped 
gradually  as  the  taxonomic  resolution  was  reduced,  with  the  exception  of  the  value  at 
the  taxonomic  level  of  class.  Similar  declines  in  power  relative  to  taxonomic  resolution 
were  apparent  in  all  four  of  the  ordination  axes.  The  diversity  index  was  an  exception 
because  power  was  greater  at  coarser  taxonomic  levels.  This  is  due  to  smaller 
variances  around  the  mean  diversity  of  each  lake  when  the  data  were  pooled  at  coarse 
taxonomic  levels. 

For  most  of  the  metrics,  power  values  of  approximately  95%  were  obtained  at  a 
taxonomic  level  of  order  (Figure  2).  As  a  result,  the  routine  use  of  species-level 
identifications  may  be  unnecessary  in  this  study.  Considerable  savings  may  be 
obtained  because  order-level  identifications  require  limited  taxonomic  expertise  and 
virtually  no  additional  time  to  prepare  the  specimens.  The  power  analysis  provided  an 
objective  evaluation  of  a  minimum  taxonomic  resolution. 
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Using  power  analysis  to  select  indicators  -  Guidelines  and  criteria  are  routinely 
developed  for  indicators.   However,  the  identification  and  selection  of  indicators  from 
long  lists  of  variables  is  difficult.  Occasionally,  indicators  are  selected  because  they  are 
correlated  with  a  particular  response,  even  though  the  underlying  cause  is  not 
understood.  More  often,  indicators  are  identified  by  modelling  the  fate  and  transport  of 
specific  contaminants.  The  following  example  illustrates  how  power  analysis  can  also 
be  used  to  select  indicators. 

A  recent  study  examined  variation  in  water  quality  within  large  lakes  in  south-central 
Ontario.  Given  an  apparent  deterioration  of  water  quality  in  some  embayments  with 
greater  urban  development,  we  proposed  the  following  question:  "What  parameters  are 
good  indicators  of  differences  in  the  water  quality  of  different  embayments  in  a  large 
lake?".  Water-quality  data  were  collected  from  six  embayments  on  the  same  lake. 
Each  embayment  was  sampled  once  each  month  during  the  ice-free  season.  Seven 
variables  were  selected  from  a  list  of  parameters  that  were  routinely  monitored. 

A  one-way  analysis  of  variance  was  used  to  compare  the  means  from  each  embayment 
for  each  variable.   Differences  between  the  embayments  were  examined  in  a  power 
analysis.   In  this  scenario,  low  power  may  arise  from  limited  variation  between 
embayments,  or  from  large  seasonal  variation  (i.e.,  error)  relative  to  the  differences 
between  embayments.  In  either  instance,  low  power  would  imply  that  a  parameter  has 
limited  utility  as  an  indicator. 
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The  power  analysis  revealed  two  subsets  of  variables  (Figure  3).  Total  phosphorus 
(TP),  secchi  disk  depth  and  chlorophyll-a  concentration  had  less  than  95%  power  to 
detect  differences  among  the  six  embayments.  In  fact,  the  power  values  for  secchi  disk 
depth  and  chlorophyll-a  were  less  than  80%  indicating  relatively  linnited  power.  This 
observation  is  not  surprising  since  these  three  variables  exhibit  nnarked  seasonal 
variation  associated  with  algal  blooms.  By  contrast,  the  power  for  conductivity 
(Cond@25),  calcium,  sodium  and  chloride  all  exceeded  95%.  Conductivity,  sodium  and 
chloride  probably  reflect  differences  in  the  amount  of  road  salt  used  around  each 
embayment  in  the  winter.  There  is  little  seasonal  variation  in  these  three  parameters. 
Similarly,  the  high  power  for  calcium  likely  indicates  that  tributaries  to  the  different 
embayments  drain  geologically  different  areas. 

These  results  also  emphasize  that  indicators  should  not  be  selected  on  the  basis  of  a 
power  analysis  alone  (Figure  3).  For  example,  conductivity,  sodium  and  chloride  all  had 
high  power.  However,  our  knowledge  of  the  sources  of  these  parameters  suggests  that 
we  may  be  measuring  winter  road-salt  use  as  a  surrogate  for  urban  development.  This 
may  or  may  not  be  acceptable.  Alternatively,  calcium  also  had  high  power,  but  it  may 
simply  reflect  the  geological  differences  of  the  catchments  draining  into  each 
embayment.  Of  the  three  remaining  variables  that  may  reflect  nutrient  inputs,  only  total 
phosphorus  had  moderate  power.  Thus  total  phosphorus  may  be  a  reasonable 
indicator  if  we  are  concerned  with  nutrient  levels.  Moreover,  the  addition  of  a 
seasonality  factor  could  provide  more  acceptable  power  for  total  phosphorus. 
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Expressing  power  as  a  minimum  detectable  change  -  Herein,  power  has  been 
expressed  as  the  probability  of  detecting  a  significant  difference.  As  such,  power  is 
based  on  alpha,  the  sannple  size  and  the  effect  size,  which  is  expressed  as  the 
difference  between  the  means  relative  to  the  error  in  estimating  the  means.  These 
same  power  equations  can  also  be  re-arranged  to  calculate  the  appropriate  sample 
sizes  (e.g.,  Zar  1984,  Alldredge  1987)  and  the  minimum  detectable  change  (e.g.,  Zar 
1984,  Spooner  et  al.  1987).  In  this  context,  the  minimum  detectable  change  is  simply 
the  minimum  effect  size  that  can  be  detected  with  a  specified  power. 

The  concept  of  minimum  detectable  change  has  appeal  because  it  puts  a  statistical 
attribute  on  a  scale  that  is  consistent  with  the  original  variable.  For  example,  the 
minimum  detectable  change  can  be  expressed  as  a  percentage  change  (e.g.,  see  Zar 
1984).  Thus,  we  can  ignore  the  original  measurement  scale  and  units,  as  well  as  the 
statistical  underpinnings.  However,  there  are  two  limitations  to  this  approach.  One 
problem  is  the  arbitrary  selection  of  an  appropriate  power.  As  noted  above,  there  is  no 
agreement  regarding  a  critical  threshold  for  power.  Consequently  the  minimum 
detectable  change  with  80%  power  will  be  smaller  than  the  minimum  change  with  95% 
power.  Thus  a  probabilty  should  be  provided  with  each  minimum  detectable  change 
even  though  this  probability  tends  to  erode  the  appeal  of  a  simple  percentage  change. 

The  second  limitation  has  greater  significance.  This  issue  rests  in  the  interpretation  of 
a  minimum  detectable  change.  That  is,  what  is  the  ecological  significance  of  a 
particular  minimum  detectable  change?  For  example,  will  a  15%  change  jeopardize  the 
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survival  of  a  population?  Both  of  these  questions  were  previously  discussed  in  the 
section  on  Unresolved  Issues.  However,  the  issue  of  ecological  significance  has 
particular  relevance  in  our  efforts  to  establish  standards  and  guidelines.  The  need  to 
resolve  these  short-comings  should  be  obvious. 

CONCLUSIONS  AND  RECOMMENDATIONS 

This  report  briefly  introduces  the  concept  of  power  analysis  and  provides  worked 
examples  using  Student's  ttest,  an  analysis  of  variance  and  a  simple  linear  regression. 
Power  is  the  ability  of  a  study  to  correctly  reveal  a  significant  difference  when  the  null 
hypothesis  is  false.  As  such,  power  calculations  can  be  used  to  examine  the  effects  of 
changing  alpha  (i.e.,  the  probability  of  a  Type  I  error),  beta  (i.e.,  the  probability  of  a 
Type  II  error),  the  sample  size,  or  the  effect  size.  Changes  to  the  effect  size  can  also 
include  changes  to  the  variability  (or  error)  associated  with  field  and  laboratory 
protocols.  When  the  costs  of  these  changes  are  included,  then  a  given  study  can  be 
optimized  to  provide  the  greatest  power  for  the  least  cost. 

Three  recommendations  are  proposed:  (1)  that  power  analysis  be  used  in  program 
planning  in  order  to  establish  the  ability  of  a  study  to  detect  a  significant  difference,  the 
number  of  samples  required,  and  the  minimum  detectable  change;  (2)  that  power 
analysis  calculations  use  critical  values  of  beta  equal  to  those  for  alpha  to  ensure  a 
minimum  of  95%  power  to  correctly  identify  a  significant  difference;  and  (3),  that  criteria 
and  guidelines  for  ecological  significance  (rather  than  statistical  significance)  be 
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established  in  order  to  evaluate  the  minimum  detectable  change. 
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Figure  1:  Relationship  between  Alpha  and  Beta 
with  respect  to  the  Hq  and  Ha  distributions  and 
the  critical  F  value 
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Figure  2:  Power  of  different  metrics  with  varying  taxonomic  resolution 
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Figure  3:  Power  of  different  variables  as  possible  indicators 
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